Agenda

  • Me
  • RStudio
  • Setup:
    • How I do Python install and package management
    • Setting up RStudio for Python use

Agenda

  • reticulate
    • Environments
    • Interacting with Python from R
    • Type conversions
  • Live demo?

About Me

Who is Jennifer Knack?

Quick aside: RStudio

Disclaimer

R is my preferred coding language. All code in this talk, unless specified, is R code.

I use RStudio as my preferred IDE, and this talk was written from that perspective.

You can just as easily do this all in VSCode. But please…

Don’t use Jupyter Notebooks.

  • Hidden states hinder reproduciblity
    • Cells can be run in any order, any number of times. According to one study1, 73% of all Jupyter notebooks are not reproducible with straightforward approaches.
  • Version control is nearly impossible because Jupyter notebooks are stored as single-line JSON files
    • Merging branches with git versioning nearly always breaks because git version control works on a line-by-line basis
    • Visualizing and understanding differences between versions is an impossible task
  • No linting, no style help, no IDE integration
  • Test-driven development is very difficult
  • Jupyter notebooks don’t scale well with big data

RStudio is objectively the best IDE

  • easy GUI with console and terminal windows
  • easy, integrated package management (for R)
  • easy, integrated report/presentation generation via Rmarkdown or Quarto
  • git integration
  • easily view environment objects and run history
  • linting, syntax help, and tab-completion all built in
  • easily navigate and view files and set working directory
  • plot view and help windows

“But why not just use VSCode?”

Because I don’t want to.

I like (and more importantly, know) RStudio.

Setup

Python install and package management

There are utilities to install and manage Python packages and environments in RStudio2 but I don’t use them.

I use mamba3 instead.

Setting up RStudio to run Python

  1. Tell RStudio where you installed Python

Go to Tools -> Global Options

Setting up RStudio to run Python

Enter the path to the python interpreter in your base environment

Setting up RStudio to run Python

  1. Install the reticulate package
install.packages("reticulate")

reticulate4 is a Tidyverse package containing tools for interoperability between Python and R.

reticulate

What does reticulate do?5

  1. Allows calling Python from R in multiple ways including from RMarkdown/Quarto, sourcing Python scripts, importing Python modules, and using Python interactively within an R session
  1. Translates between R and Python objects (e.g., between R and Pandas data frames, or between R matrices and NumPy arrays)
  1. Provides flexible binding to different versions of Python including virtual environments and Conda environments

A note from Posit about the philosophy behind Python tools in Rstudio

These tools are not intended for standalone Python work but rather explicitly aimed at the integration of Python into R projects (and as such are closely tied to the reticulate package).

They “strongly suggest” using one of the IDEs available for doing data science in Python for Python-only projects.6

What does reticulate do?

  1. Allows calling Python from R in multiple ways including from RMarkdown/Quarto, sourcing Python scripts, importing Python modules, and using Python interactively within an R session

  2. Translates between R and Python objects (e.g., between R and Pandas data frames, or between R matrices and NumPy arrays)

  1. Provides flexible binding to different versions of Python including virtual environments and Conda environments

Set your environment

First thing you always want to do is set your environment:

library(reticulate)
use_condaenv("madpy")

Do this whether you’re working interactively, calling scripts, or authoring reports.

There is also reticulate::use_python() that allows you to specify an alternative version of python other than the one you set in your global options, or reticulate::use_virtualenv() to set a virtual environment instead of a conda environment.

What does reticulate do?

  1. Allows calling Python from R in multiple ways including from RMarkdown/Quarto, sourcing Python scripts, importing Python modules, and using Python interactively within an R session
  1. Translates between R and Python objects (e.g., between R and Pandas data frames, or between R matrices and NumPy arrays)

  2. Provides flexible binding to different versions of Python including virtual environments and Conda environments

The py object

When you call library(reticulate), it creates the py object in the reticulate package environment.

It is the bridge between R and Python, through which you can run Python code and interact with Python objects.

The most common way you will interact with it is to access any Python object from the R environment using the $ operator, e.g., py$x.

Important

Always call library(reticulate) or you won’t be able to access the py object!

The r object

Similarly, reticulate creates the r object in the Python environment it creates. Through it, you can access R objects in the Python environment.

You can access R objects using the . operator, e.g., r.x.

Note

Examples of using the py and r objects are found further in this presentation

Importing modules

reticulate::import() can be used to import any installed Python module into your R environment.

os <- import("os")

Then you can call any function from that module in R using $.

os$listdir(".") |> head()
[1] ".git"          ".gitignore"    ".Rbuildignore" ".RData"       
[5] ".Rhistory"     ".Rproj.user"  

If you’d like to access built in Python functions, use reticulate::import_builtins().

builtins <- import_builtins()
builtins$print('Hello, World!')
Hello, World!

Sourcing scripts

Let’s say I have a Python script that defines a function:

## this is Python code

def add(x,y):
  return x + y

If I’d like to use that function in R, I can source it using reticulate::source_python().

source_python('add.py')
add(5,10)
[1] 15

Executing code

Let’s say my collaborator wrote a Python script for processing some raw data. I’d like to work with the processed data in R, but my collaborator only provided me with the raw data and the script.

I know the script requires a variable file that is a character string pointing to the path of the raw data, and outputs a Pandas data frame called df.

I can use reticulate::py_run_string() and reticulate::py_run_file() to process the data, and then access any objects created into the Python main module using the py object exported by reticulate:

# Set the Python variable pointing to the raw data file
py_run_string("file = 'extdata/rawdata.csv'")

# run the processing script, which takes the file argument
py_run_file("process_raw_data.py")

# access the resulting df
py$df
  a b
1 1 4
2 2 5
3 3 6

Working in RMarkdown7

reticulate includes a Python engine for RMarkdown, and knitr v 1.18 (2017) and higher uses this engine by default.

Set your environment in your setup chunk:

Important

Always call library(reticulate) or you won’t be able to access the py object!

Then you can start inserting Python chunks just like you would R chunks, and knitr will knit everything together:

Just like when working interactively, you can access objects created in Python chunks in R by using the py object:

And you can access objects created in R chunks in Python by using the r object:

Working in Quarto

Quarto provides all the support for Python that RMarkdown does, plus support for Jupyter:

  • Quarto supports rendering with the Jupyter kernel in addition to knitr and reticulate – just put jupyter: python3 in your YAML header and make sure the paths to Python and Jupyter are in your PATH.
  • You can also provide a full kernelspec in your YAML:
---
title: "My Document"
jupyter:
  kernelspec:
    name: xpython
    language: "python"
    display_name: "Python 3.7 (XPython)"
---
  • Because it can use the Jupyter kernel, Quarto CLI can render Jupyter notebooks too:
quarto render document.ipynb

There is other support for Python from the Quarto CLI as well, plus a VSCode Quarto plugin.8

Working with Python interactively

If you want to work with Python interactively, you can call reticulate::repl_python() to initiate a Python REPL embedded in the R console.

You can use the py and r objects to access objects between environments.

What does reticulate do?

  1. Allows calling Python from R in multiple ways including from RMarkdown/Quarto, sourcing Python scripts, importing Python modules, and using Python interactively within an R session
  1. Translates between R and Python objects (e.g., between R and Pandas data frames, or between R matrices and NumPy arrays)
  1. Provides flexible binding to different versions of Python including virtual environments and Conda environments

Type conversions

When calling into Python, R data types are automatically converted to their equivalent Python types.

When values are returned from Python to R they are converted back to R types.9

The automatic conversion between R types and Python types works well in most cases, but sometimes you might want more control over the conversions.

Conversion table

R Python Examples
Single-element vector Scalar 1, 1L, TRUE, "foo"
Multi-element vector List c(1.0, 2.0, 3.0), c(1L, 2L, 3L)
List of multiple types Tuple list(1L, TRUE, "foo")
Named list Dict list(a = 1L, b = 2.0), dict(x = x_data)
Matrix/Array NumPy ndarray matrix(c(1,2,3,4), nrow = 2, ncol = 2)
Data Frame Pandas Dataframe data.frame(x = c(1,2,3), y = c("a", "b", "c"))
Function Python function function(x) x + 1
Raw Python bytearray as.raw(c(1:10))
NULL, TRUE, FALSE None, True, False NULL, TRUE, FALSE

Controling when conversion happens

If you’d like to work directly with Python objects by default you can pass convert = FALSE to the reticulate::import() function.

# import numpy and specify no automatic Python to R conversion
np <- import("numpy", convert = FALSE)

# do some array manipulations with NumPy
a <- np$array(c(1:4))
(sum_np <- a$cumsum())
array([ 1,  3,  6, 10])
# what is sum?
class(sum_np)
[1] "numpy.ndarray"         "python.builtin.object"

Then when you’re done working with the object in Python, you can convert it to an R object explicitly with reticulate::py_to_r().

# convert to R explicitly at the end and print object
(sum_r <- py_to_r(sum_np))
[1]  1  3  6 10
# what is sum_r?
class(sum_r)
[1] "array"

Note

We’ll be using both the sum_np and sum_r objects in examples later on, so take note of their difference

Defining the conversion

Numeric types are different between R and Python. For example, 42 in R is a float, while in Python it’s an integer.

If you want to explicitly define a number as an integer in R so that it’s passed as such to Python, use the L suffix:

class(42)
[1] "numeric"
class(42L)
[1] "integer"

If a Python API requires a list but you’re only passing it a single element, you can wrap it in base list():

foo$bar(indexes = list(42L))

Similarly, if the Python API wants a tuple, you can use reticulate::tuple():

tuple("a", 5.5, FALSE)

And if the Python API wants a dictionary, you can use reticulate::dict():

dict(foo = "bar", index = 42L)

Indices

Python uses 0-based indices for collections:

sum_np[0L]
np.int64(1)

while R uses 1-based indices:

sum_r[1]
[1] 1

Note

Notice the need to explicitly use an integer when slicing the Python object

Python indices are non-inclusive for the end range, while R indices are:

sum_np[2L:4L]
array([ 6, 10])
sum_r[2:4]
[1]  3  6 10

And negative indexing in Python counts from the end of the container, while in R it removes that index:

sum_np[-1L]
np.int64(10)
sum_r[-1]
[1]  3  6 10

Arrays

Wait, do you mean vectors or matrices?

NumPy uses one type of object, the ndarray, for an indexed collection of numbers with any number of dimensions.

sum_np
array([ 1,  3,  6, 10])
class(sum_np)
[1] "numpy.ndarray"         "python.builtin.object"

R has multiple types of objects for this.

The R array is analogous to the NumPy ndarray and can also handle any number of dimensions.

sum_r
[1]  1  3  6 10
class(sum_r)
[1] "array"

More commonly, folks will use a numeric vector for 1-dimensional indexed collections of numbers.

rnorm(1:4)
[1]  0.3373614 -2.9059156 -2.2847108  0.5909855
class(rnorm(1:4))
[1] "numeric"

These will be coerced to arrays by R automatically whenever they interact with other arrays mathematically.

A matrix is a subset of array with two dimensions.

matrix(rnorm(1:4),2,2)
            [,1]       [,2]
[1,]  0.71124966 -0.7040865
[2,] -0.06670778 -0.6020136
class(matrix(rnorm(1:4),2,2))
[1] "matrix" "array" 

The TL;DR10

R and Python represent 2+ -dimensional arrays in memory differently:

  • R only supports column-major order (FORTRAN-style)
  • NumPy supports both column-major and row-major (C-style) order, but defaults to row-major

The most important thing to remember about this is that

R and Python print arrays differently.

Column- vs. Row-Major Order

Modified from image by Cmglee - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=65107030

Reshaping arrays

The R dim() function is used to reshape arrays in R. This works by changing the dim attribute of the array, effectively re-interpreting the array indices using column-major semantics.

Remember though that NumPy uses row-major semantics by default; hence, its reshape method uses row-major semantics.

# NumPy reshape uses row-major semantics
np$reshape(np$arange(1,5), c(2L,2L))
array([[1., 2.],
       [3., 4.]])

So if you’re mixing R and Python code, you may get inconsistent results.

To overcome this, use reticulate::array_reshape() to reshape R arrays using row-major semantics.

# make an array from a vector
# of 4 elements:
u <- 1:4

# dim() uses
# column-major semantics
dim(u) <- c(2,2)
u
     [,1] [,2]
[1,]    1    3
[2,]    2    4
# array_reshape() uses
# row-major semantics
array_reshape(1:4, c(2,2))
     [,1] [,2]
[1,]    1    2
[2,]    3    4

Grouping while printing 3+D arrays

These are the exact same array:

(x <- np$arange(1, 9)$reshape(2L, 2L, 2L))
array([[[1., 2.],
        [3., 4.]],

       [[5., 6.],
        [7., 8.]]])
(y <- py_to_r(x))
, , 1

     [,1] [,2]
[1,]    1    3
[2,]    5    7

, , 2

     [,1] [,2]
[1,]    2    4
[2,]    6    8

These are the exact same array, so why do they look different?

Python groups by the first index when printing, while R groups by the last index:

x
array([[[1., 2.],
        [3., 4.]],

       [[5., 6.],
        [7., 8.]]])
x[0L,,]
array([[1., 2.],
       [3., 4.]])
x[,,0L]
array([[1., 3.],
       [5., 7.]])
y
, , 1

     [,1] [,2]
[1,]    1    3
[2,]    5    7

, , 2

     [,1] [,2]
[1,]    2    4
[2,]    6    8
y[1,,]
     [,1] [,2]
[1,]    1    2
[2,]    3    4
y[,,1]
     [,1] [,2]
[1,]    1    3
[2,]    5    7

What about arrays from R to Python?

In the previous example I created an array in Python and ported it to R. What about the other way around?

(v <- array(1:24, c(4, 3, 2)))
, , 1

     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12

, , 2

     [,1] [,2] [,3]
[1,]   13   17   21
[2,]   14   18   22
[3,]   15   19   23
[4,]   16   20   24
(w <- np$array(v))
array([[[ 1, 13],
        [ 5, 17],
        [ 9, 21]],

       [[ 2, 14],
        [ 6, 18],
        [10, 22]],

       [[ 3, 15],
        [ 7, 19],
        [11, 23]],

       [[ 4, 16],
        [ 8, 20],
        [12, 24]]], dtype=int32)

The NumPy array will be created using column-major ordering:

w$flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

Remember:

  • F for “FORTRAN” (column-major order)
  • C for “C” (row-major order)

You can always create NumPy arrays in column-major order by passing the "F" flag:

np$reshape(np$arange(1, 25), c(4L, 3L, 2L))
array([[[ 1.,  2.],
        [ 3.,  4.],
        [ 5.,  6.]],

       [[ 7.,  8.],
        [ 9., 10.],
        [11., 12.]],

       [[13., 14.],
        [15., 16.],
        [17., 18.]],

       [[19., 20.],
        [21., 22.],
        [23., 24.]]])
np$reshape(np$arange(1, 25), c(4L, 3L, 2L), "F")
array([[[ 1., 13.],
        [ 5., 17.],
        [ 9., 21.]],

       [[ 2., 14.],
        [ 6., 18.],
        [10., 22.]],

       [[ 3., 15.],
        [ 7., 19.],
        [11., 23.]],

       [[ 4., 16.],
        [ 8., 20.],
        [12., 24.]]])

You can rearrange R arrays into row-major order, but it’s gross.

More array considerations

  • Dense R arrays are presented to Python as column-major NumPy arrays (FORTRAN-style).
  • All NumPy arrays (column-major, row-major, or otherwise) are presented to R as column-major arrays, since that’s all R can understand.
  • R arrays are only copied to Python when they need to be, otherwise data are shared.
  • NumPy arrays are always copied when moved into R arrays. This can sometimes lead to multiple copies of any one array in memory at one time.

Sparse matrices

reticulate supports the conversion of sparse matrices created by the Matrix R package to and from SciPy CSC matrices.11

I tried to make an example but working out the dependencies for scipy.sparse was way too much work.

https://rstudio.github.io/reticulate/articles/python_dependencies.html may have been helpful but I didn’t care enough.

Data Frames

The important points

As mentioned earlier, R data frames can be automatically converted to and from Pandas data frames. By default, columns are converted using the same rules governing R array <=> NumPy array conversion, with a couple extensions:

  • R factors <=> Python categorical variables
  • R POSIXt times <=> NumPy array with dtype=datetime64[ns]

The important points

If the R data frame has row names, the generated Pandas data frame will be re-indexed using those row names, and vice versa.

If a Pandas data frame has a DatetimeIndex, it is converted to character vectors as R only supports character row names.

Using Pandas nullable data types

Pandas out of the box handles NAs differently than R:

(df <- data.frame(
  int = c(NA, 1:4),
  num = c(NA, rnorm(4)),
  lgl = c(NA, rep(c(TRUE, FALSE), 2)),
  string = c(NA, letters[1:4])
))
  int        num   lgl string
1  NA         NA    NA   <NA>
2   1  0.5235702  TRUE      a
3   2 -1.1265969 FALSE      b
4   3 -1.5273399  TRUE      c
5   4 -1.8577981 FALSE      d
r_to_py(df)
          int       num    lgl string
0 -2147483648       NaN  False   None
1           1  0.523570   True      a
2           2 -1.126597  False      b
3           3 -1.527340   True      c
4           4 -1.857798  False      d

However, Pandas has experimental support for nullable data types (represented by pd.NA), but you have to enable it first:

# tell Pandas to use NAs
options(reticulate.pandas_use_nullable_dtypes = TRUE)

r_to_py(df)
    int       num    lgl string
0  <NA>      <NA>   <NA>   <NA>
1     1   0.52357   True      a
2     2 -1.126597  False      b
3     3  -1.52734   True      c
4     4 -1.857798  False      d

Final Points

For advanced Python users

For advanced Python users, there’s more documentation on

  • Contexts
  • Iterators
  • Functions
  • Creating high-level R interfaces for Python libraries

Check out https://rstudio.github.io/reticulate/articles/calling_python.html for deets.

Access to Python help from R

You can print documentation on any Python object using reticulate::py_help():

py_help(os$chdir)

This will open a text document outside of RStudio:

Help on built-in function chdir in module nt:

chdir(path)
    Change the current working directory to the specified path.

    path may always be specified as a string.
    On some platforms, path may also be specified as an open file descriptor.
      If this functionality is unavailable, using it raises an exception.

Some more reading

There is also this University of Chicago article aimed at Python users who are new to R, as well as an excellent article aimed at R users who are new to Python.

Your favorite search engine can tell you more too.

And I’m sure one of you will mention

rpy2, basically reticulate for the Python environment.

Documentation is not as robust as for reticulate, though.

Session info

R

sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reticulate_1.38.0

loaded via a namespace (and not attached):
 [1] digest_0.6.36     fastmap_1.2.0     xfun_0.46         Matrix_1.7-0     
 [5] lattice_0.22-6    rappdirs_0.3.3    knitr_1.48        htmltools_0.5.8.1
 [9] png_0.1-8         rmarkdown_2.27    cli_3.6.3         grid_4.4.1       
[13] withr_3.0.1       compiler_4.4.1    rstudioapi_0.16.0 tools_4.4.1      
[17] evaluate_0.24.0   Rcpp_1.0.13       yaml_2.3.10       rlang_1.1.4      
[21] jsonlite_1.8.8   

Python

py_config()
python:         C:/Users/jenna/.conda/envs/madpy/python.exe
libpython:      C:/Users/jenna/.conda/envs/madpy/python312.dll
pythonhome:     C:/Users/jenna/.conda/envs/madpy
version:        3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:04:44) [MSC v.1940 64 bit (AMD64)]
Architecture:   64bit
numpy:          C:/Users/jenna/.conda/envs/madpy/Lib/site-packages/numpy
numpy_version:  2.0.1
os:             froze

NOTE: Python version was forced by use_python() function

Python

py_list_packages("iffrug")
           package      version                requirement     channel
1            bzip2        1.0.8                bzip2=1.0.8 conda-forge
2  ca-certificates     2024.7.4   ca-certificates=2024.7.4 conda-forge
3     intel-openmp     2024.2.0      intel-openmp=2024.2.0 conda-forge
4          libblas        3.9.0              libblas=3.9.0 conda-forge
5         libcblas        3.9.0             libcblas=3.9.0 conda-forge
6         libexpat        2.6.2             libexpat=2.6.2 conda-forge
7           libffi        3.4.2               libffi=3.4.2 conda-forge
8         libhwloc       2.11.1            libhwloc=2.11.1 conda-forge
9         libiconv         1.17              libiconv=1.17 conda-forge
10       liblapack        3.9.0            liblapack=3.9.0 conda-forge
11       libsqlite       3.46.0           libsqlite=3.46.0 conda-forge
12         libxml2       2.12.7             libxml2=2.12.7 conda-forge
13         libzlib        1.3.1              libzlib=1.3.1 conda-forge
14             mkl     2024.1.0               mkl=2024.1.0 conda-forge
15           numpy        2.0.1                numpy=2.0.1 conda-forge
16         openssl        3.3.1              openssl=3.3.1 conda-forge
17          pandas        2.2.2               pandas=2.2.2 conda-forge
18             pip       24.1.2                 pip=24.1.2 conda-forge
19  pthreads-win32        2.9.1       pthreads-win32=2.9.1 conda-forge
20          python       3.12.4              python=3.12.4 conda-forge
21 python-dateutil        2.9.0      python-dateutil=2.9.0 conda-forge
22   python-tzdata       2024.1       python-tzdata=2024.1 conda-forge
23      python_abi         3.12            python_abi=3.12 conda-forge
24            pytz       2024.1                pytz=2024.1 conda-forge
25             six       1.16.0                 six=1.16.0 conda-forge
26             tbb    2021.12.0              tbb=2021.12.0 conda-forge
27              tk       8.6.13                  tk=8.6.13 conda-forge
28          tzdata        2024a               tzdata=2024a conda-forge
29            ucrt 10.0.22621.0          ucrt=10.0.22621.0 conda-forge
30              vc         14.3                    vc=14.3 conda-forge
31    vc14_runtime  14.40.33810   vc14_runtime=14.40.33810 conda-forge
32  vs2015_runtime  14.40.33810 vs2015_runtime=14.40.33810 conda-forge
33              xz        5.2.6                   xz=5.2.6 conda-forge

Footnotes

  1. https://lilicoding.github.io/papers/wang2020assessing.pdf

  2. https://rstudio.github.io/reticulate/articles/python_packages.html

  3. https://mamba.readthedocs.io/en/latest/index.html

  4. https://rstudio.github.io/reticulate/

  5. https://rstudio.github.io/reticulate/articles/calling_python.html

  6. https://rstudio.github.io/reticulate/articles/rstudio_ide.html

  7. https://rstudio.github.io/reticulate/articles/r_markdown.html

  8. https://quarto.org/docs/computations/python.html

  9. https://rstudio.github.io/reticulate/articles/calling_python.html#type-conversions

  10. https://rstudio.github.io/reticulate/articles/arrays.html

  11. https://rstudio.github.io/reticulate/articles/calling_python.html#sparse-matrices